kracktoo6581@floridapoly.eduThis mini-project provided several sets of unrelated data for exploration of the tools and principles discussed thus far throughout the course. The “babynames”, “Florida_Lakes”, and “atl-weather” datasets were studied to extract trends and determine relationships that may exist within the data. Visualizations were created to demonstrate the capacity for interactivity, spatial presentation, and modeling using the tools provided. After the process used for manipulating and visualizing the data are described, the findings regarding each set of data are presented in the section below corresponding to that dataset.
As some of the libraries used were shared by the different sections, all of the libraries will be imported before beginning.
library(tidyverse) # For ggplot2 and dplyr
library(stringr) # For string manipulation
library(plotly) # For interactive plots
library(sf) # For shapefiles
library(foreign) # For dbf files
library(broom) # For cleaning up the model created
To explore a set of data through an interactive visualization, the “babynames.rds” file was used. The data in this file listed the most popular names of babies born in the United States since 1880, including the number of babies in each year with each name. I chose to examine the frequency of names ending in the letter “a” across both sexes over time, as it’s a widespread assumption that names ending in that letter tend to be female. I questioned whether this was as much assumption as it was perception, and the data presented seemed to offer a conclusive answer. To examine the data in the described manner, the data file was first read into R using readRDS().
babyNames <- readRDS("data/babynames.rds")
The data was then cleaned by removing all names that did not end in the letter “a”.
aNames <- babyNames %>%
filter(str_ends(name, "a"))
We want a dataframe that provides the number of babies of each sex whose names ended in “a” for each year. To obtain it, we can use dplyr to make two separate dataframes where one contains female “a” names for each year and the other contains male “a” names for each year. Then, we can use inner_join() to join them together by year.
aNamesSummarizedMale <- babyNames %>%
filter(str_ends(name, "a") & sex == 'M') %>%
group_by(year) %>%
summarize(aCountMale = sum(n))
aNamesSummarizedFemale <- babyNames %>%
filter(str_ends(name, "a") & sex == 'F') %>%
group_by(year) %>%
summarize(aCountFemale = sum(n))
aNamesSummarized <- inner_join(aNamesSummarizedFemale, aNamesSummarizedMale, by = "year")
Converting the ggplot element to a plotly visual allows the user to more directly compare the two sets of data through interaction. The plot can be set to compare the two plots at a given x-position on hover by default using the layout() layer as shown below.
aNamesPlot <- ggplot(data = aNamesSummarized) +
geom_line(mapping = aes(x = year, y = aCountMale), color = "cyan") +
geom_line(mapping = aes(x = year, y = aCountFemale), color = "pink") +
labs(title = "Baby names ending in 'a' by year", x = "Year", y = "Number of babies with 'a' names") +
theme_minimal()
ggplotly(aNamesPlot) %>%
layout(hovermode = "x unified")
Hovering over any location on the combined plot gives a number of the total number of babies born per sex whose names ended in the letter “a” for that year. As a result, I have to concede that names ending in that letter do in fact almost always indicate a female individual. The data shows that male names ending in “a” were all but nonexistent up until only the last few decades. Even then, there have been relatively few, and it has remained a small number. Also interesting, however, is that female names ending in that letter were similarly uncommon. There appears to have been a significant increase in such female names in the last century, and the number has since fluctuated wildly, peaking in 2006 and remaining relatively constant, since.
The second set of data chosen included shapefiles for all of the lakes in the state of Florida. Additionally, the dataset maps each lake to the county to which it is associated. Given this information, I wanted to create an interactive visual for the simplified identification of lakes belonging to each county.
The data for shapefiles and county information were provided in two different files. There were likewise read into distinct variables.
lakeShapes <- read_sf("data/Florida_Lakes/Florida_Lakes.shp")
lakes <- read.dbf("data/Florida_lakes/Florida_Lakes.dbf")
As the two sets of data were initially distinct, they were combined into a singular dataframe for simplicity. This was done using the left_join() layer and the dplyr pipe operator. Dataframes were joined by the “OBJECTID” values as these were common to both dataframes and associated shapes with counties.
lakeMap <- lakeShapes %>%
left_join(lakes, by = "OBJECTID")
ggplot() is used to generate a plot, while the geom_sf() layer allows the use of shape information in the plot. This was to be visualized using the Mercator projection through coord_sf(). Basic styling was applied, making the outlines of each lake minimal so as to not crowd the plot, and discrete colors were applied to distinguish counties. Finally, the plot was made interactive using ggplotly(), This interactivity would include the ability to zoom into any area, and individual selection of counties through the generated legend.
lakePlot <- ggplot() +
geom_sf(data = lakeShapes, aes(fill = COUNTY), color = "black", size= 0.1) +
coord_sf(crs = "+proj=merc") +
scale_fill_discrete() +
theme_light() +
labs(title = "Florida lakes by county")
ggplotly(lakePlot)